Doublet Detection Tutorial¶

Author: Hui Ma, Yiming Yang, Rimte Rocher
Date: 2022-02-24
Notebook Source: doublet_detection.ipynb

In [1]:
import pegasus as pg

Dataset¶

In this tutorial, we'll use the output result of Pegasus Tutorial to demonstrate how to detect and remove doublet cells in Pegasus. The dataset consists of human bone marrow single cells from 8 donors.

The dataset is stored at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip. You can also use gsutil to download it via its Google bucket URL (gs://terra-featured-workspaces/Cumulus/MantonBM_result.zarr.zip).

Now load the count matrix:

In [2]:
data = pg.read_input("MantonBM_result.zarr.zip")
data
2022-02-24 12:28:55,493 - pegasusio.readwrite - INFO - zarr file 'MantonBM_result.zarr.zip' is loaded.
2022-02-24 12:28:55,495 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.97s.
Out[2]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 35465 x 25653
    Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'X', 'raw.X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    obsm: 'X_diffmap', 'X_fle'(basis), 'X_pca', 'X_pca_harmony', 'X_phi', 'X_tsne'(basis), 'X_umap'(basis), 'diffmap_knn_distances'(knn), 'diffmap_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'pca_harmony_knn_indices'(knn)
    varm: 'means', 'partial_sum', 'de_res'
    obsp: 'W_diffmap', 'W_pca_harmony'
    uns: 'genome', 'louvain_resolution', 'modality', 'norm_count', 'pca_features', 'stdzn_max_value', 'PCs', 'diffmap_evals', 'ncells', 'stdzn_mean', 'stdzn_std', '_attr2type', 'df_qcplot', 'pca'

Sections¶

  • Detect Doublets
  • Remove Doublets and Recluster

Detect Doublets¶

In this step, infer doublets per channel. Set clust_attr = 'anno' to see the doublet density in each cluster and infer doublet cluster.

The method used for detecting doublets can be found here.

In [3]:
pg.infer_doublets(data, channel_attr = 'Channel', clust_attr = 'anno') 
2022-02-24 12:28:55,676 - pegasus.tools.preprocessing - INFO - After filtration, 21108/25653 genes are kept. Among 21108 genes, 17375 genes are robust.
2022-02-24 12:28:55,676 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.14s.
2022-02-24 12:28:55,748 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:28:55,761 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:28:55,802 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:28:55,803 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-02-24 12:28:58,483 - pegasus.tools.doublet_detection - INFO - Sample MantonBM1_HiSeq_1: doublet threshold = 0.1606; total cells = 4415; neotypic doublet rate in simulation = 44.52%; neotypic doublet rate = 2.22%.
2022-02-24 12:28:59,046 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 3.24s.
2022-02-24 12:28:59,762 - pegasus.tools.preprocessing - INFO - After filtration, 20329/25653 genes are kept. Among 20329 genes, 16739 genes are robust.
2022-02-24 12:28:59,762 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.16s.
2022-02-24 12:28:59,835 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:28:59,847 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:28:59,890 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:28:59,891 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.06s.
2022-02-24 12:29:02,923 - pegasus.tools.doublet_detection - INFO - Sample MantonBM2_HiSeq_1: doublet threshold = 0.1401; total cells = 4935; neotypic doublet rate in simulation = 57.44%; neotypic doublet rate = 3.12%.
2022-02-24 12:29:03,298 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 3.41s.
2022-02-24 12:29:04,022 - pegasus.tools.preprocessing - INFO - After filtration, 20231/25653 genes are kept. Among 20231 genes, 16424 genes are robust.
2022-02-24 12:29:04,023 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.15s.
2022-02-24 12:29:04,090 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:29:04,101 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:04,146 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:04,147 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.06s.
2022-02-24 12:29:06,622 - pegasus.tools.doublet_detection - INFO - Sample MantonBM3_HiSeq_1: doublet threshold = 0.1495; total cells = 4225; neotypic doublet rate in simulation = 41.70%; neotypic doublet rate = 1.78%.
2022-02-24 12:29:07,042 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.89s.
2022-02-24 12:29:07,835 - pegasus.tools.preprocessing - INFO - After filtration, 20593/25653 genes are kept. Among 20593 genes, 16941 genes are robust.
2022-02-24 12:29:07,835 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.19s.
2022-02-24 12:29:07,907 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:29:07,918 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:07,960 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:07,961 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-02-24 12:29:10,436 - pegasus.tools.doublet_detection - INFO - Sample MantonBM4_HiSeq_1: doublet threshold = 0.1453; total cells = 4172; neotypic doublet rate in simulation = 40.96%; neotypic doublet rate = 1.73%.
2022-02-24 12:29:10,835 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.87s.
2022-02-24 12:29:11,641 - pegasus.tools.preprocessing - INFO - After filtration, 20955/25653 genes are kept. Among 20955 genes, 17374 genes are robust.
2022-02-24 12:29:11,642 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.19s.
2022-02-24 12:29:11,726 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.08s.
2022-02-24 12:29:11,739 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:11,783 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:11,783 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.06s.
2022-02-24 12:29:14,126 - pegasus.tools.doublet_detection - INFO - Sample MantonBM5_HiSeq_1: doublet threshold = 0.1651; total cells = 4090; neotypic doublet rate in simulation = 33.70%; neotypic doublet rate = 1.59%.
2022-02-24 12:29:14,528 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 2.74s.
2022-02-24 12:29:15,307 - pegasus.tools.preprocessing - INFO - After filtration, 21035/25653 genes are kept. Among 21035 genes, 17344 genes are robust.
2022-02-24 12:29:15,308 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.23s.
2022-02-24 12:29:15,396 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.09s.
2022-02-24 12:29:15,408 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:15,451 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:15,452 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-02-24 12:29:18,300 - pegasus.tools.doublet_detection - INFO - Sample MantonBM6_HiSeq_1: doublet threshold = 0.1871; total cells = 4665; neotypic doublet rate in simulation = 41.07%; neotypic doublet rate = 1.59%.
2022-02-24 12:29:18,698 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 3.25s.
2022-02-24 12:29:19,461 - pegasus.tools.preprocessing - INFO - After filtration, 20449/25653 genes are kept. Among 20449 genes, 16754 genes are robust.
2022-02-24 12:29:19,462 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.22s.
2022-02-24 12:29:19,537 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:29:19,550 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:19,592 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:19,593 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.05s.
2022-02-24 12:29:22,204 - pegasus.tools.doublet_detection - INFO - Sample MantonBM7_HiSeq_1: doublet threshold = 0.1420; total cells = 4452; neotypic doublet rate in simulation = 47.79%; neotypic doublet rate = 3.01%.
2022-02-24 12:29:22,623 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 3.03s.
2022-02-24 12:29:23,449 - pegasus.tools.preprocessing - INFO - After filtration, 20070/25653 genes are kept. Among 20070 genes, 16434 genes are robust.
2022-02-24 12:29:23,450 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.20s.
2022-02-24 12:29:23,525 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.07s.
2022-02-24 12:29:23,537 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.01s.
2022-02-24 12:29:23,581 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:23,582 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.06s.
2022-02-24 12:29:26,306 - pegasus.tools.doublet_detection - INFO - Sample MantonBM8_HiSeq_1: doublet threshold = 0.1923; total cells = 4511; neotypic doublet rate in simulation = 34.21%; neotypic doublet rate = 1.15%.
2022-02-24 12:29:26,678 - pegasus.tools.doublet_detection - INFO - Function '_run_scrublet' finished in 3.10s.
2022-02-24 12:29:27,243 - pegasus.tools.doublet_detection - INFO - Doublets are predicted!
2022-02-24 12:29:27,244 - pegasus.tools.doublet_detection - INFO - Function 'infer_doublets' finished in 31.73s.

Here, plot annotation and Scrublet-like doublet score.

In [4]:
pg.scatter(data,attrs=['anno','doublet_score'], basis='umap', wspace=1.2) 

We also want to see the doublet percentage of each cluster to decide if there is a doublet cluster.

In [5]:
data.uns['pred_dbl_cluster']
Out[5]:
cluster percentage pval qval
0 CD14+ Monocyte 4.462044 0.000003 0.000018
1 Natural killer cell 3.307692 0.000013 0.000036
2 B cell 3.277932 0.000004 0.000018

All clusters have doublet percentage under 5%, so no need to mark any doublet clusters here. If any cluster has doublet percentage more than $50\%$, we can consider to mark it as doublet cluster.

For example, If we want to mark 'CD14+ Monocyte' and 'CD14+ Monocyte-2' as doublet clusters, use the following code:

pg.mark_doublets(data, dbl_clusts = 'anno:CD14+ Monocyte,CD14+ Monocyte-2')

The mark_doublets function will mark doublet cluster (if any), and write singlet/doublet assignment to the "demux_type" column attribute in data.obs. The "demux_type" attribute is also used for singlet/doublet assignment of cell hashing, nucleus hashing and genetics pooling data (see documentation).

For this demonstration dataset, among $35,465$ cells, $724$ doublets detected. Doublet rate is $2.04\%$:

In [6]:
pg.mark_doublets(data)
data.obs['demux_type'].value_counts()
Out[6]:
singlet    34741
doublet      724
Name: demux_type, dtype: int64

Doublets distribution can be better observed in UMAP plot:

In [7]:
pg.scatter(data, attrs = ['anno', 'demux_type'], legend_loc = ['on data', 'right margin'], 
           wspace = 0.1,alpha = [1.0, 0.8], palettes = 'demux_type:gainsboro,red')

Remove Doublets and Recluster¶

In [8]:
pg.qc_metrics(data, select_singlets=True)
pg.filter_data(data)
2022-02-24 12:29:38,098 - pegasusio.qc_utils - INFO - After filtration, 34741 out of 35465 cell barcodes are kept in UnimodalData object GRCh38-rna.
2022-02-24 12:29:38,098 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.77s.

Start the reclustering process from re-selecting highly variable genes. Batch effect is observed, so we also want to use harmony algorithm to correct bach effect for reclustering.

In [9]:
pg.highly_variable_features(data, batch='Channel')
pg.pca(data)
pca_key = pg.run_harmony(data)
pg.neighbors(data,rep=pca_key)
pg.louvain(data,rep=pca_key)
pg.umap(data,rep=pca_key)
2022-02-24 12:29:38,288 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.18s.
2022-02-24 12:29:38,332 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 12:29:38,333 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.22s.
2022-02-24 12:29:43,555 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 5.22s.
2022-02-24 12:29:44,252 - pegasus.tools.batch_correction - INFO - Start integration using Harmony.
	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).
	Completed 3 / 10 iteration(s).
	Completed 4 / 10 iteration(s).
	Completed 5 / 10 iteration(s).
	Completed 6 / 10 iteration(s).
Reach convergence after 6 iteration(s).
2022-02-24 12:30:01,166 - pegasus.tools.batch_correction - INFO - Function 'run_harmony' finished in 17.61s.
2022-02-24 12:30:08,742 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 7.57s.
2022-02-24 12:30:09,921 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.18s.
2022-02-24 12:30:11,036 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.05s.
2022-02-24 12:30:33,059 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 16 clusters.
2022-02-24 12:30:33,094 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 23.16s.
2022-02-24 12:30:33,094 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2022-02-24 12:30:33,095 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
UMAP(min_dist=0.5, precomputed_knn=(array([[    0, 29734, 20650, ..., 20908, 32334, 19609],
       [    1, 13768, 10422, ..., 21043, 22195, 5157],
       [    2, 8957, 30941, ..., 33034, 34066, 33654],
       ...,
       [34738, 34456, 30776, ..., 19529, 33571, 16419],
       [34739, 16813, 5464, ..., 27302, 9559, 5461],
       [34740, 8469, 21696, ..., 31917, 15034, 4294]]), array([[0.       , 5.1923103, 5.2694206, ..., 5.560473 , 5.56299  ,
        5.563344 ],
       [0.       , 7.234295 , 7.639039 , ..., 8.320163 , 8.425379 ,
        8.499373 ],
       [0.       , 4.881042 , 4.9392414, ..., 5.1259046, 5.1599846,
        5.1828895],
       ...,
       [0.       , 6.496879 , 7.0165043, ..., 7.630356 , 7.6736407,
        7.6774874],
       [0.       , 7.983525 , 8.119182 , ..., 9.061247 , 9.132658 ,
        9.183697 ],
       [0.       , 5.869001 , 6.701155 , ..., 7.7948093, 7.8501515,
        7.967697 ]], dtype=float32), <pegasus.tools.visualization.DummyNNDescent object at 0x7fb6044aa610>), random_state=0, verbose=True)
Thu Feb 24 12:30:33 2022 Construct fuzzy simplicial set
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Thu Feb 24 12:30:35 2022 Construct embedding
Epochs completed:   0%|            0/200 [00:00]
Thu Feb 24 12:30:59 2022 Finished embedding
2022-02-24 12:30:59,581 - pegasus.tools.visualization - INFO - Function 'umap' finished in 26.49s.

Re-annotate:

In [10]:
pg.de_analysis(data, cluster='louvain_labels')
celltype_dict = pg.infer_cell_types(data, markers = 'human_immune',de_test='mwu',output_file='BM_celltype_re_dict.txt')
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='louvain_labels', anno_dict=cluster_names)
2022-02-24 12:31:00,420 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.8296s.
2022-02-24 12:31:13,352 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 12.9324s.
2022-02-24 12:31:13,470 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.1177s.
2022-02-24 12:31:13,567 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished.
2022-02-24 12:31:13,570 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 13.98s.

Umap of annotation after re-clustering:

In [11]:
pg.scatter(data,attrs='anno',legend_loc='on data',basis='umap')
calc_mwu finished for genes in [6414, 12827).
calc_mwu finished for genes in [12827, 19240).
calc_mwu finished for genes in [0, 6414).
calc_mwu finished for genes in [19240, 25653).